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INTERCONNECT MINIMIZATION IN busses. Because these data paths may include a large number 

PROCESSOR DESIGN °f function units, register files and bus segments, the job of 

selecting the function units, register files and bus segments 

RELATED APPLICATION DATA is very difficult, to say the least. The task is made more 

5 difficult if one desires to find an efficient configuration. 

Ibis patent application is related to the following U.S. pipelined data paths are particularly useful in processing 

patent applications, commonly assigned and filed concur- ; tcra tive instructions, such as those found in instruction 

rently with this application: loops> and especially nested instruction loops. When con- 

U.S. patent application No. 09/378,298, entitled PRO- sidering a subset of situations where the instruction loops are 

GRAMMATIC SYNTHESIS OF PROCESSOR ELEMENT 10 known, such as those used with embedded processors, the 

ARRAYS, by Robert S. Schreiber, Bantwal Ramakrishna task of designing the optimal, low-cost processor still exists 

Rau, Shail Aditya Gupta, Vinod Kumar Kathail, and Sadun because of the large number of different function units, 

Anik - anc j register configurations and bus configurations that are pos- 

U.S. patent application No. 09/378,431, entitled FUNC- sible - Heuristic ^"'j? 118 

HON UOTT ALLOCATION IN PROCESSOR DESIGN, by I5 P™*^* ™ ™ l "J* 1 * fc n r & * d m *>^™ to such 

DK^tccin k ' J « multi-dimensional problems. Because there are so many 

Robert b. Schreiber. variables to consider, it is too difficult to optimize all of 

These patent applications are hereby incorporated by variables to arrive at a suitable solution without great 

reference. expenditure of time and effort. 

BACKGROUND OF THE INVENTIONS 20 SUMMARY OF THE INVENTIONS 

1 Field of the Invention ^ De P resent inventions provide methods and apparatus for 
These inventions relate to processor design, and more 



" 1( . t t . v c , , t . ffor ^exaUnnlelt thosej mcoiTOT 
specifically to interconnection of hardware components id ^ proccsso *^fiSSto or execution units^lls^rW/ 
processor design such as for systohc processors and apph- ^ the like. These methods and apparatus reduce 
cation specific integrated processors (ASIPs). ^e tiK7equired ^^g^g^ptoc^^ andreduce 
2. Related Art ^ e amo^t 0 f trial and error used in processor design. They 
Processor design is a very time intensive and expensive find m0 re efficient configurations for interconnecting func- 
process. For new and unique processor designs, no auto- tion units and registers, and they can do so much faster and 
mated design techniques exist for selecting and designing 30 m0 re reliably than conventional methods and apparatus, 
the mix of processor components or for efficiently intercon- They also reduce so me of the costs associated with starting 
necting those components that would be incorporated into the desig^O^a ntiW-processorhom scratch, which often may 
the final processor design. While there exist algorithms beniecessary in the design of embedded processors. / 
incorporated into software packages that can help in design- md other aspects 0 f me present Inventions are 
ing new processors, such software packages do not give a 35 prov ided by methods and apparatus for assembling a set of 
result which is a final design, let alone an optimal design. hardware component and bus assemblies, such as for an 
Typically, those software packages provide approximate embedded processor having pipelined data paths. The hard- 
solutions to a design problem, typically leading to additional ware components could be function units such as adders, 
design effort and over-design to account for the lack of ^multipliers; arithmetic logic units, registers and the like. The 
precision in those software packages. Additionally, the 4 6: meuTods could be carried out on, and the-apparatus^ould 
design process may start entirely from scratch, which would include, any manner of equipment, such as computers' and' 
result in substantial time being consumed analyzing possible r other„processors including: maingames, worEsffl^^ 
design configurations before designing the details of the the like, as well as apparatus"^ 

processor. On the other hand, designing a new processor rfofuse in controlling such processors, such as disk drives, 

using preexisting designs necessarily incorporates the ^5 removable storage media, and temporary storage. In one 

design benefits and flaws of the preexisting design, which aspect of me present inventions, the process includes iden- 

may or may not be acceptable or optimal for the new design. ^fying busses and hardware components to which each bus 

All conventional processor design software packages are is assigned for a given operation, and identifying bus assign- 
heuristic in nature. In other words, they rely on design ments for which operations occur on the same hardware 
criteria and/or methods that in the past have proven more 50 component. In the preferred embodiment, at least some of 
effective than other criteria or methods. However, in order to the hardware components for which different operations 
apply to more than one processor design or design occur on the same -hardware component are assigned to the 
methodology, such design criteria and methods must be same bus. With this procedure, the process of designing 
sufficiently general to provide predictable results. Therefore, inter-connects between hardware components is easier and 
such heuristic software packages provide relatively high- 55 m0 re reliable. The resulting architecture of the processor is 
level solutions without a complete contribution to details of improved and the layout of the bus structure is simplified, 
the design. Additionally, heuristic software packages neces- By way of a simple example, some design techniques may 
sarily lead to significant trial and error in an attempt to . assign ^ different busses to the same set of hardware compo-/ 
optimize the processor design. Consequently, design of new nents where data is being transferred between4he hardware 
processors is time intensive and expensive. 60 components within the set during two different cycles. The 

Processors are often designed to incorporate pipelined present procedure more easily identifies such redundancy 

data paths to speed processing throughput, reduce initiation and assigns a single bus to the set of hardware components, 

intervals and to optimize use of the various function units, even though the design may treat the data transfers as 

such as adders, multipliers, comparators, dividers, registers separate operations. Therefore, identifying redundant bus 

and the like. These data paths are formed from an intercon- 65 structures between hardware components related by com- 

nected assembly of function units and register files. The mon data transfers occurring over different cycles allows for 

function units and register files may be interconnected by consolidation of those redundant bus structures. 
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In another preferred form of the inventions, the process stored. The other permuted matrices will also be optimized 

includes the steps of comparing a table of bus assignments relative to a new accumulation matrix prepared by OR-ing 

for each of a number of operations to be carried out over a all the permuted matrices but the one to be optimized. At the 

number of known cycles. The table may be a matrix or other cn( j 0 f the second sweep, the newly permuted matrices are 
representation of a relationship between a set of b.usses and 5,, then stored and used to create the bus interconnect configu- 

hardware components. The hardware components may be rations. Altejaifltrtf^^^ matrix can be 

function or execution units and registers or register files, or prepared 4 ^ ^R*irig»^^^^ 

other comparable components. In the preferred embodiment, / The .final ,ac^mrnrtallon , matnx'represents the bus intercon- 

different matrices will represent the relationships for differ- nec ts with the hardware components, 
ent cycles. The matrices, or other representations of the 10 In preferred form of the invention, times- are 

relationships, are then processed to identify potential re dun- idcntificd for each bus ih^ plupalit^ofjbusses when the bus 

Jancies m ^assignment of a bus to one or more hardware '. ^ toi^rmeMdlibla*^ 

components. Tne bus assignments are then redistributed to two £^r^b;P^^^ sanre*naraware 

reduce or eliminate the redundancies. In one preferred com £,ne$ "at different times /into V single bus. In this 

embodiment, the matrices are processed-two St a time to 15 man nerrdifferent busses that may be assigned to the same 

optimize the bus assignment and interconnect configuration hardware components, but at different cycles of a process, 

by solving a convention al a ssignment prohlen ^For can be merged into a single bus. Consequently, data transfers 

^example, the m^ ^^^^p^miitnx witrTTn^ " occurring at different times but between the same hardware 

transpose of a^fc M:Satrix produces ^ series ot numbers Components can occur on the same bus. 
m_a*eorrel£uoh mainx whose, diagonal represents busses 20 

' having common^onne"cTIo^If the columns of the trans- BRIEF DESCRIPTION OF THE DRAWINGS 

posed matrix, representing the bus connections for that FIG. 1 is a schematic and assembly drawing of an 
matrix, are permuted until the cross product produces a ^ fof hardware com ts for a pro cess 0r 

diagonal whose sum is a maximum, the permuted matrix such as m embedded systolic processor, including apparatus 

producing the maximum has a bus connection configuration 25 for feceivi { {Q ^ deliveri results from the xUc , 

which is optimum relative^ to the bus configuration of the ^ process 

first matrix. Thus, if trie busses and hardware components P , _ r . 

— — _ , , . , *u *u ri - * *• a *u FIG. 2 is a schematic diagram of part of a processor such 

are connected-in-accordance with the first matrix and the . ? , , , r r A , 

< tB£ rmutrdl§EEa.*matnx«for the two cycles represented by » a P'P elmed P rocessor ™* J»n*™» components and a 
feftiMee^ore emint^^nggtanfigura.-ab ^T' T^T^ , ' ""fT? t 

tio^preser^^^tWo- o^a! matrices. - - < d 1 esl S ned m accordance with the apparatus and methods of 

\~7 - .. c jc c .u ■ .• 1 . • the present inventions. 

In a further preferred form of the invention, correlation _T_ „ . . . , , 

matrices are produced for multiple combinations of pairs of 3 » a schematic of a sample software routine 

the matrices so that each of the original matrices can be 35 segment mat may be executed m a processor designed using 
permuted to produce a.more efficient bus connection con- dcsi ^ s and racthods of the P rcscnl inventions, 

figuration. In a particular preferred. embodiment, the first/ FIG- 4 is a schematic of a machine instruction level 

rmatrijrand-the permuted .matrix a^OR^ed^ tb^cmer to^ sample routine derived from the software routine of FIG. 3. 

produce i Ji^ is then used to ^ FIG. 5 is a table showing a hypothetical bus-data transfer 

pfoduce^COTr^a^rl'ffimx^th a third transposed matrix. 40 relationship over several cycles of a computation. 
The correlation matrix is again used to find a permuted FIG. 6 is a table showing a hypothetical rearrangement of 

matrix which optimizes the sum of the numbers on the data transfers for the busses and for the several cycles 

diagonal of the correlation matrix, thereby determining the depicted in FIG. 5 

optimum bus interconnection configuration for the third pjes. 7A-7C are tables depicting bus and hardware 

matrix relative to the accumulation matrix. The optimized 45 component interconnections as a function of the cycle time 

third matrix is then OR-ed with the accumulation-matrix to-/ relative to an arbitrary cycle start and having several redun- 

_produce a new accumulation matrix. Each of the individual dancies. 

optimized, permuted matrices is stored. The process is FIGS 8A _ 8C are {Me& depicting bus and hard ware 

repeated- until all of the individual matrices-have* been J c onent interconnections as a function of the cycle time 

P5{m£Jgte^ 50 relative to m arbitrary cycle start after some permutations to 

C ra &^^^^ account for the redundancies depicted in FIGS. 7A-7C. 
against which the permuted matrix was optimized. These „ T ^,„ nr ^ . .... ... j u j 

6 4 . . K , , . . " t ~ .. FIGS. 9A-9D are matrices depictmg the bus and hard- 

permuted matrices produce a bus interconnect configuration . . r & „ , . . , . 

4 . ffi . \ ( , . 4 , - & ware component interconnections represented in the tables 

that is more efficient than the starting configurations. ^ FIGS 7A-7C 

In another form of the inventions, a second sweep of the 55 r ™^ . 

matrices representing the bus interconnect configurations is * 10 1S a trans P 0Se malnx f me matnx 1 of - t ? t0 

carried out so that each of the matrices can be optimized be used in an optnnization procedure m accordance with one 

relative to all of the others. In the first sweep, the second as P cct of the P rcsent invcntl0DS * 

matrix was permuted and optimized relative, in the first FIG. 11 is a correlation matrix of the matrix of FIG. 9A 

matrix, but not any of the others. (The" last matrixes 60 and the transpose of FIG. 10. 

permuted and optimized relative to^ aTTuf Hie piuxdl ng F IG S- 12A and 12B are the matrix of FIG. 9A and a 

rmatrices. In the second sweep, each matrix will be permuted permuted transpose of the matrix of FIG. 9B, respectively. 

and optimized in view of all.the other matrices. Specifically, FIG. 13 is a correlation matrix of the matrices of FIGS. 

a new accumulation matrix will be prepared by OR-ing all 12A and 12B showing an increased sum of the diagonal of 

c of the permuted matricesjogether except for the first matrix. 65 the correlation matrix upon permuting the matrix Al. 
_Thenew accumulation matrix will then-be* used to optimize FIG. 14 is an accumulation matrix from the matrices of 

the first matrix. The optimized permuted matrix will then be FIGS. 12 A and 12B. 
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FIG. 15 is a flow chart representing a method or process In the following discussion, a number of methods and 

for reducing the number of interconnects between one or systems are described that are implemented in programs, 

more busses and hardware components and using the infer- The term "programmatic" refers to a system or method 

mation to develop a hardware description language. embodied in a program or programs implemented in soft- 

nG.16isaflowchartrepresentingonemethodorprocess 5 ware executed on a computer in hardwired circuits, or a 

for the reducing process depicted in FIG. 15. combination of hardware and software. In the current 

^ ■ a . 4 . .u j *■ implementation, the programs as well as the input and output 

FIG. 17 . a flow chart represen ung one method for ^ structures are £ f emented in stored on lhe 

optumzmg bus interconnecte by solvmg an assignment workstalion . s mcmo F s tem . ^ programs and data struc . 

problem in accordance with one aspect of the present , . . . ' ' A „,°a~-a - 

f 10 tures may be implemented using standard programming 

inventions. languages, and ported to a variety of computer systems 
FIG. 18 is a flow chart representing a further method for having differing processor and memory architectures. In 
optimizing bus interconnects in accordance with a further general, these memory architectures are referred to as corn- 
aspect of the present inventions. puter readable me dia. 

DETAILED DESCRIPTION OF THE 1 5 Particular applications to which the present inventions are 

PREFERRED EMBODIMENTS directed include designing processors such as embedded 

processors. Embedded processors are used extensively as 

The inventions, some of which are summarized above, controllers or processors for equipment, appliances, enter- 

and defined by the enumerated claims may be better under- tainment devices, and the like. These processors have pre- 

stood by referring to the following detailed description, M defined functions and operations, and many of the opera- 

which should be read in conjunction with the accompanying t j ons are repetitive. These repetitive operations, by their 

drawings. This detailed description of a particular preferred DaturCj lend themselves to being carried out on function 

embodiment, set out below to enable one to build and use umts and reg j s ters depicted schematically at 34 (FIG. 2) all 

one particular implementation of the invention, is not arranged in such a way as to move incoming data and 

intended to limit the enumerated claims, but to serve as a 25 transfer the results of operations along a path termed a 

particular example thereof. The particular examples set out "pipeline". The pipeline arrangement takes advantage of the 

below are the preferred specific implementations of the natural flow of data through the operations while minimizing 

interconnect system that can be used for a number of register load requests and data transfers. Busses 36 may be 

applications and implemented in a number of ways. The uscd t0 interconnect the function units and the register files 

inventions, however, may also be applied to other systems as 3Q 34 through junctions 38. The present inventions can be used 

we ^- to quickly and inexpensively optimize part of the design of 

In accordance with several aspects of the present such embedded processors. While the following discussion 

inventions, apparatus and methods are disclosed for select- will focus on embedded processors and their design, with 

ing hardware components for a processor such as an embed- particular emphasis on the interconnection of function units 

ded processor that decreases the time and effort required for 35 and register files with busses, configured to optimize data 

processor design. The methods and apparatus also reduce the pipelining, it should be understood that the inventions 

amount of trial and error used during processor design, and described herein may be applicable to the design and manu- 

produce a more predictable and definitive result than general facture of other processor configurations, 

heuristic methods. They also improve the architecture of the Processors operate based on instructions from a software 

processor and free up space on the layout of the processor 40 program or other instruction source. Part of that software 

and reduce the cost of the end product. Consequently, these program or other instruction source may include a loop body 

methods and apparatus significantly reduce the cost of or loop nest represented in FIG. 3 by the sample routine 40. 

design. This sample routine would be part of a larger software 

In one aspect of the present inventions, the methods can program or other instruction source, but serves in this 
be carried out on a pre-programmed digital processor, a 45 instance as a suitable example of a set of calculations that 
general-purpose computer or a workstation, such as at 20 can be carried out on an embedded processor, such as one 
(FIG. 1), which can receive input from a conventional including an ensemble of function units, register files and 
keyboard 22. The input can take the form of input data for possibly busses forming a pipelined data path. The sample 
use by algorithms processed by the computer 20, or other routine may include an add operation 42, a multiply opera- 
input or information as necessary. Information such as 50 tion 44 and an add operation 46 within a loop 48, which in 
intermediate results, queries, requests for input, final results turn is nested within a loop 50. This loop nest can be 
or other information can be displayed by the computer on a evaluated and considered along with other requirements to 
monitor 24 or output, as desired. determine possible Initiation Intervals (II) according to 

The computer 20 can receive applications programs and/ which the operations would be carried out on various 

or data from a number of different sources, including a 55 function units. Initiation Interval may take into account the 

removable storage device such as disk 26 for a disk drive 27 cycle time required by the equipment or appliance into 

or any other movable tangible storage medium, such as which the computing unit will be placed. For example, the 

portable disk drives and the like. Applications programs and Initiation Interval may depend at a macro level on the 

data can also be received from a network host, host proces- number of pages to be scanned per second in a scanner, the 

sor or mainframe computer 28, or from more remote loca- 60 number of sheets to be output per minute by a printer or 

tions such as off-site servers, over the Internet, or through copier, and the like. It also may depend on a micro level on 

other conventional communications paths. For example, such factors as recurrences and the like. Using this 

data can be received from a satellite antenna 30 through a information, the mix of function units and registers will be 

transmitter receiver 32 linked to an input and output port on determined to carry out the instructions represented by the 

the computer 20. Other linkages and communications meth- 65 code, and busses assigned to the function units and registers 

ods can be used with the computer 20 in order to receive data to permit data transfers. Given an ensemble of function 

and software, and to transmit data, results and software. units, register files, and busses, the interconnections of those 
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units will then be determined to preferably minimize the 
number of individual busses, and to optimize the intercon- 
nections between the function units, register files and busses 
while still carrying out the required data transfers. If the 
interconnections can be optimized, the cost of manufactur- 
ing the processor can be reduced by avoiding inefficient 
connections, and possibly by reducing the number or com- 
plexity of the bus configurations. One preferred approach for 
doing so is described herein. 

By way of example, a simple operation will be described 
to demonstrate the development of a schedule of operations 
for a loop body, including data transfers, as a function of 
clock cycles at which they occur. A sample routine includes 
a number of operations, as shown in FIG. 3, and in this 
example includes several loops over which the operations 
will occur. Within a loop, it can be seen that the operations 
will repeat, especially in pipelined data paths. An initiation 
interval can be defined which is the Dumber of cycles 
between the start of one operation of a loop iteration and the 
start of the next iteration of loop for the same operation, in 
other words the number of cycles between successive itera- 
tions of the loop. The sample routine can be converted into 
an equivalent sample routine at the machine instruction level 
depicted as 52 in FIG. 4. The instruction level sample 
routine 52 identifies the cycle times over which the opera- 
tions of the inner loop nest 48, including data transfers, 
could be carried out on possible function units, such as an 
adder 54 and a multiplier 56. Because data transfers repre- 
sent a step in the operations, and require a cycle time to be 
assigned to each data transfer, certain cycles include data 
transfers between hardware components, such as the Fetch 
Data operations at Cycle 1, and the Write Data operation at 
Cycle 6. Specifically, the Fetch Data from Register 1 for the 
hardware component Adder 54 uses a data transfer on a bus 
(such as B3 in FIG. 2) connected between the register and 
the adder, both function units in FIG. 2. This connection 
occurs at Cycle 1. Another connection occurs at Cycle 1 as 
well, namely for the transfer between register 7 and the 
multiplier 56, which may occur on bus B2. Other connec- 
tions are scheduled at Cycles 4 and 6. Once the operations 
are analyzed and the relationships between the function units 
and the registers are defined, busses or other connections 
will be made to allow the data transfers to occur at the 
desired cycle time. It is desired to minimize the number of 
the connections used to achieve these data transfers, because 
each connection made in the hardware has a cost not only in 
terms of creation but also in terms of real estate taken up by 
the bus and its connections. 

Given an instruction level sample routine 52 and an 
initiation interval 11, there are numerous interconnections of 
busses and hardware components that could be used to make 
the data transfers defined by the sample routine 40 and 
instruction level sample routine 52. In accordance with one 
aspect of the present inventions, an assignment problem 
algorithm 58 (FIG. 1) is used to find a solution to the 
optimization problem of identifying the desired intercon- 
nections between hardware components and busses to 
achieve the data transfers defined by the instruction level 
sample routine. The solution to the interconnect design 
problem is a series or group of definitions for the intercon- 
nections between the busses and the hardware components 
either on a cycle-by-cycle basis or as a function of the busses 
and hardware components to which the busses are con- 
nected. This information is then used with the hardware 
definitions and assignments to create a hardware description 
language for the processor. 

In one preferred embodiment of the inventions, a general- 
purpose computer, workstation or other processor 20 (FIG. 



25 



30 



1) may accept as input an instruction program and parameter 
set 60 (FIG. 15). This information may be input from a 
keyboard 22, a removable storage medium 26, from a 
mainframe 28 or from a communications link. The instruc- 
5 tion program and parameter set may include the initiation 
interval or a similar criterion, and a library of the function 
units and busses or other components that have been 
scheduled, for example with a modulo scheduler, and 
assigned to be included in the final processor. The processor 
10 20 will also include, already loaded or retrieved from 
another source, an algorithm for matrix manipulation and an 
algorithm for solving an assignment problem for optimizing 
the interconnects between the busses and the hardware 
components. These algorithms may be separate routines or 
35 a series of routines or may be a single program for carrying 
out the methods set forth herein. The processor 20 then 
creates matrices of the bus and hardware data transfer 
assignments in step 62 using the instruction program and 
parameter set and definitions provided from step 60. In a firs t 
20 pass, the matrices are optimized using the assignment prob- 
lem aigorun m to minimize to a nrst approximatio n the 
interconnects between the busses and the hardware compo- 
nents (fuiwiioii mills aim register Mfc&) at step 64. in one" 
preferred embodiment, a second pass then further optimizes 
the interconnects between the busses and the hardware 
components using the assignment problem algorithm at step 
66. Two passes may be optimum, if further passes do not 
lead to significant optimization. However, it is preferred to 
do as many passes as can be reasonably done while the 
results provide further optimization. What is reasonable will 
depend on the processing costs for doing additional passes 
compared to the incremental improvements obtained. 
Moreover, the time savings provided by this method may 
perm it a relatively large number of passes wh ile overall st ill 
using up less time in the this part ot the design phase ' 
compared to conventional techniqu es. Following the 
optimization, t he bus int er connects are assigned and a set of 
interconnect definitions are generated at step 68 and the 
definitions along with the hardware definitions are output at 
step 70. 

Thereafter, the processor 20 or another suitable apparatus 
generates a hardware description in a standard hardware 
description language at step 72, such as VHDL, and outputs 
the hardware description language so that it can be 
delivered, at step 74, to a manufacturing site. The manufac- 
turing site then creates at step 76 a suitable product such as 
an embedded processor 78 which incorporates the function 
units and bus interconnects between the function units and 
register files, suitable to execute the program for which it 
was designed. 

Considering the methods and apparatus of the present 
inventions in more detail, it will be assumed for the present 
discussion that a loop body and an initiation interval have 
been defined, such as by coded instructions or other means 

55 for defining the operations to be carried out on the processor 
being designed. Once the function units have been allocated 
and scheduled, such as with a modulo scheduler, every 
required operation including data transfers has been 
assigned a start time or cycle number relative to the start 

60 time of an iteration. Busses have also been allocated for the 
scheduled data transfers corresponding to the function units 
or hardware components scheduled by the modulo sched- 
uler. Therefore, for each cycle within an initiation interval, 
all of the operations have been scheduled. Some of these 

65 operations include data transfers, assigned to the number of 
busses necessary to carry out the data transfers during the 
given cycle, and there will typically be data transfers for 



45 



50 
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each cycle within the initiation interval. It then remains to mined. For example, the data transfers generated by a 

determine, for every cycle, which bus will handle which data modulo scheduler for a given program code and ensemble of 

transfer, in other words what connections are required function units and generic busses can be depicted by a truth 

between any given bus and the ensemble of hardware table or function unit-bus array. An example of such an array 

components to be included in the processor. Because there 5 is provided in FIGS. 7A-7C for a first Cycle 0, a second 

is a cost associated with the building of a bus and the Cycle 1 and on out to the last Cycle II -1, represented in FIG. 

connections made between the bus and a given hardware 7C. In the first table 84 for Cycle 0, the modulo scheduler 

component, it is desirable to minimize the number of con- has assigned data transfers to occur on Bus 1 between 

nections to a bus, and to minimize the number of busses. function units Ul, U4 and U5. Other data transfers may be 

It should be understood that several outer boundaries 10 scheduled for Cycle 0, such as represented for Bus 3 
apply to the number of busses and the number of connec- between function units U2 and U3. No other data transfers 
tions to the busses. For example, there will be a minimum are shown as being scheduled for function units U1-U5, but 
number of discrete busses used, which number is determined it should be understood that other data transfers could be 
by the largest number of data transfers taking place within scheduled during Cycle 0 for the other busses for function 
the given initiation interval. For other cycles within the 15 units U6-Un, where "n" is the total number of hardware 
initiation interval, it is possible that not all of the busses will components. When considering the data transfers scheduled 
be used if fewer than the maximum number of data transfers for the remaining cycles, it can be seen that some of the same 
are occurring during the other cycles. If B is the minimum hardware components have also been connected to other 
number of discrete busses required, B! (B factorial) rep re- busses, as indicated by the solid dots corresponding to other 
sents the number of choices for connections to the busses for 20 Dus numbers. For example, during Cycle 1, data transfers 
each cycle, assuming that the first cycle within a given occur between units U4 and U5 and a Bus 5, as shown in the 
initiation interval has a defined connection assignment. array 86, which data transfer would ordinarily require a 
Additionally, the number of possible bus connections is connection between them. Similarly, data transfers will 
given by B! raised to the (II- 1) power. Here, II represents occur between those same units on a Bus 6 during the last 
the initiation interval, or the number of cycles that occur 2 s cycle, Cycle II-l, as shown in array 88. If the scheduling can 
before an iteration starts again. Consequently, there are a modify or reassign the data transfer to occur not on Bus 5 but 
large number of possible interconnect combinations for the instead on Bus 1 for Cycle 1, and if the data transfer 
busses, and it is desirable to have an efficient method for currently scheduled for Bus 6 in Cycle II -1 can be modified 
deciding how to interconnect the busses and the hardware or reassigned to be on Bus 1, data transfers between hard- 
components in order to carry out the operations defined by 30 ware components U2, U4 and U5 can be accommodated by 
the program code, or other instructions to be processed by Bus 1 for those later cycles. Units U4 and U5 will have 
the processor. already been connected to Bus 1, and a connection for U2 

While it is possible that all the function units and register can be added to accommodate the data transfers for hard- 
files can be connected to all of the busses, thereby elimi- ware component U2 in Cycles 1 and II-l. In doing so, five 
nating any need to decide how to connect the busses, such 35 redundant bus interconnections may have been eliminated, 
a combination is too expensive to manufacture and creates Therefore, it is desirable to minimize the number of con- 
an inefficient processor design. Alternatively, data transfers nections between the different busses and the hardware 
can be assigned to busses according to which data transfers components in order to minimize the cost inherent in each 
are required for a given cycle using a simple sequential connection. 

assignment 80 (FIG. 5). For example, during Cycle 0, data 40 The arrays 84-88 shown in FIGS. 7A-7C graphically 

transfer 1 (DTI) is assigned to bus 1 (Bl), data transfer 2 demonstrate possible redundancies in data transfer assign - 

(DT2) is assigned to bus 2 (B2) and data transfer 3 (DT3) is ments between busses and hardware components. The arrays 

assigned to bus 3 (B3). This assignment sequence continues, also can be used to visualize possible ways to minimize 

for example if the next data transfer (DT4) occurs during redundancies in those assignments. In one preferred form of 

Cycle 1, DT4 is assigned to Bl, and so on. This simple 45 the inventions, the data transfer operations, represented by 

approach is not very efficient, especially if data transfer 1 the bus and hardware component relationships depicted in 

occurring over bus 1 is between the same hardware com- FIGS. 7A-7C, are analyzed to identify bus assignments for 

ponents as data transfer 6. With the assignment represented which different operations occur on the same hardware 

at 80 in FIG. 5, the connections to bus 3 for data transfer 6 components. For example, it can be seen in FIGS. 7A-7C 

may not be necessary and can be handled by the connections 50 that hardware components U4 and U5 have data transfer 

already established for data transfer 1 on bus 1, and bus 3 operations in different cycles with different bus assignments, 

may never have to be connected to the hardware components These are considered different operations since they occur 

to which bus 1 is already assigned. For example, analysis during different cycles. 

may show that data transfer 6 to already be accommodated These data transfer interconnects are candidates for 

by the connections already established for data transfer 1 on 55 optimization, for example with the data transfer intercon- 

bus 1 during Cycle 0, and that data transfer 4 is already nects to be made with those hardware components on Bus 1 

accommodated by connections to bus 3 already established to achieve the data transfers scheduled for Cycle 0. 

for data transfer 3 during Cycle 0, as represented at 82 in Preferably, these data transfer operations for hardware com- 

FIG. 6. In summary, once a set of data transfers are assigned ponents U4 and U5 during the different cycles are assigned 

to respective busses, such as for Cycle 0, or any other Cycle 60 to the same bus, for example Bus 1, as depicted at 92 in FIG. 

chosen first, it is preferable to take advantage of the existing 8B. With such a re- assignment, connections will be made 

assignment when assigning data transfers to busses in other between Bus 1 and hardware components Ul, U2, U4 and 

cycles. U5. The connections between Bus 1 and Ul will permit the 

In one preferred embodiment, it is assumed that the data transfer depicted in array 90 in FIG. 8A. The connec- 

number of busses and the number of hardware components 65 tions between Bus 1 and U2 will permit the data transfers 

have already been selected, and the connections between the depicted in arrays 92 and 94 in FIGS. 8B and 8C, respec- 

busses and the hardware components remain to be deter- tively. The connections between Bus 1 and U4 and U5 will 
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permit the the data transfers on Bus 1 during Cycle 0, Bus 
5 during Cycle 1 and on Bus 6 during Cycle II -1. Also with 
such a re- assignment, the busses in Cycle 1 and Cycle II— 1 
are re-arranged to accomodate the new Bus 1. (Note that no 
dots are depicted in FIGS. 8B or 8C for a data transfer on 5 
Bus 1 for Ul since other data transfers are scheduled for Ul 
during those cycles.) 

Steps that can be followed to minimize the number of bus 
interconnects as described with respect to FIGS. 7A-7C 
include identifying each bus in a plurality of busses at step 10 
96 in FIG. 16. The identification of the busses can be 
developed from the hardware and bus assignment informa- 
tion input to the processor 20. The hardware and bus 
assignment information may typically be provided from a 
schedule algorithm such as a modulo scheduler, and repre- 15 
sents the hardware components and bus components to be 
used in the processor to carry out the operations defined by 
the program code or other instruction language. The busses 
may be given a designation or representation for purposes of 
identification. For discussion purposes, the busses for each 20 
cycle represented by the arrays in FIGS. 7A-7C are desig- 
nated Bus 1 and up for a total of B basses. Only eight busses 
are depicted in FIGS. 7A-C. 

For each bus, at least one hardware component is iden- 
tified corresponding to each bus for a given operation, at step 25 
98 (FIG. 16). If, for all cycles, a bus does not have a 
hardware component assigned, the scheduler has assigned 
too many busses. In the arrays of FIGS. 7A-7C, the iden- 
tification of hardware components corresponding to a data 
transfer on a bus is indicated by the black dots 100. Each 30 
black dot corresponds to a single hardware component U 
and a given bus B for a given Cycle. A table is prepared, step 
102, representing the bus assignments, and may take the 
form of the arrays depicted in FIGS. 7A-7C. However, in 
the preferred embodiment, the table of bus assignments is 35 
preferably represented by a plurality of matrices, each 
matrix representing the assignments for a given cycle. The 
matrix representations are discussed more fully below. 
Using the table of bus assignments, at least two bus assign- 
ments are identified for which "different operations" occur 40 
on the same hardware component, step 104. These bus 
assignments are candidates for optimizing the interconnects 
between the busses and the corresponding hardware com- 
ponents. For example, where data transfers are scheduled for 
the same hardware components, such as U4 and U5 in FIGS. 45 
7A-7C, and for different bus numbers, and during different 
cycles or at different times, the data transfers for hardware 
components U4 and U5 can all be assigned to the same bus, 
namely Bl. The assignment of at least some of the hardware 
components to the same bus is represented at step 106 (FIG. 50 
16). As a result, fewer bus interconnects will be made in the 
final processor than would otherwise have occurred without 
these optimization steps. 

The phrase "different operations" refers to functional 
operations occurring at different times for the same or 55 
different hardware components. "Operation" is a functional 
operation at a distinct time. Therefore, a data transfer from 
a given hardware component at a given time will be an 
operation different from the data transfer from the same 
hardware component at a different time. The same holds true 60 
for transfers to the same hardware component at different 
times. Data transfers from different hardware components, 
or to different hardware components, at the same time will 
be different operations. Data transfers between hardware 
components at a given time are the same operation. In the 65 
preferred embodiment, at least two bus assignments are 
identified for which different operations occur on the same 
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hardware component. Typically, these operations occur at 
different times, such as during different cycles. 
Consequently, no interference would occur for the schedul- 
ing of the operations on the same bus. 

In one preferred embodiment of the inventions, the iden- 
tification of busses and hardware components to which the 
busses are assigned for a given operation is achieved 
through the use of a table or other representation of the 
assignment that is generated by a scheduler or other opera- 
tion between a bus and a given hardware component. The 
table or other representation is preferably any form which 
can be manipulated or processed with a computer such as 
processor 20 to automatically carry out the optimization of 
the bus interconnects. In one preferred embodiment of the 
inventions, the table of bus assignments is represented as a 
group of matrices such as are depicted in FIGS. 9A-9D. 
Each matrix is preferably a BxU matrix where B is the total 
number of busses assigned, and U is the total number of 
hardware components, such as function units and register 
files, to be incorporated into the processor being designed. 
Each matrix will be designated by the letter A with a sub 
script designating the cycle to which the matrix corresponds. 
Each matrix element is either a 1 or a 0, a 1 indicating an 
assignment by the scheduler of the bus designated by the 
row to the hardware unit designated by the column. 
Therefore, the 1 (108) in the first row in the first column of 
matrix Aq indicates that the scheduler has assigned a data 
transfer to occur at Cycle 0 on bus Bl with hardware 
component Ul. Aq indicates that no data transfer has been 
assigned for the bus and the hardware component corre- 
sponding to the row and column, respectively. Matrix Aq in 
FIG. 9 A corresponds to the array 84 in FIG. 7 A Matrix A } 
corresponds to the array 86 and FIG. 7B and matrix A /7 _ 3 
corresponds to the array 94 in FIG. 7C. No array is given 
which corresponds to matrix Aj. 

In the preferred embodiment, there would be II matrices, 
one each corresponding to each cycle in the initiation 
interval. These matrices thus identify each bus in the plu- 
rality of busses B and the hardware components U to which 
each bus is assigned for a given data transfer These matrices 
are also in effect truth tables indicating whether or not an 
assignment has been made by a scheduler or other operator 
between a bus and a hardware component. These matrices 
can then be used to identify bus assignments for which 
different operations occur on the same hardware 
components, and therefore which are candidates for opti- 
mizing bus interconnects. Once these bus assignments are 
identified, the interconnects can be combined, linked or 
otherwise merged, and combinations of hardware compo- 
nents scheduled for data transfers among themselves at 
different times can be assigned to the same bus. 

To find an efficient way of identifying the bus assignments 
which are candidates for optimization, it is noted that an 
efficient solution can be derived through common algo- 
rithms for solving assignment problems where the initiation 
interval equals two. For example, where there are only two 
cycles in the initiation interval, an algorithm for solving 
assignment problems can be used to determine the optimum 
bus interconnect configuration to achieve the data transfers 
defined by the software code to be executed on the processor 
being designed. To extrapolate to the design where II is 
greater than 2, a solution is approximated by solving the 
assignment problem a number of times (FIGS. 17 and 18) to 
optimize the bus assignment represented by each matrix, 
after which the optimized bus assignments for each matrix 
are combined to produce a bus assignment and interconnect 
configuration for the processor. Additional passes are pre* 
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ferred to optimize the bus connections for each matrix, so permuted and re-transposed (FIG. 12B), the new bus Bl (the 
that the interconnects are optimized as much as is reasonable first column 124 of the transposed matrix) has values 
under the circumstances. indicating assignments that are common with bus Bl in the 
In one preferred embodiment of the inventions, the first matrix Aq. As a result, the sum of the digits along the 
method of optimizing the bus interconnects includes pre- 5 diagonal of the correlation matrix in FIG. 13 is greater than 
paring (step 110 in FIG. 17) representations of the bus that for the correlation matrix in FIG. 11. Therefore, per- 
assignments, for example preparing representations of muting matrix A A improved the bus assignment configura- 
matrices A (FIGS. 9A-9D). A correlation matrix (A^ in tion for the second cycle, and will reduce the number of 
FIGS. 11 and 17) is produced from two bus assignment redundant connections to the busses. Additionally, it is 
matrices Aq and A 1 by talcing a cross product (116 in FIG. 10 relatively trivial to demonstrate that when the permutation is 
17) of the first bus assignment matrix (AJ with a transpose found > the permutation maximizes the number of connec- 
of the second matrix, A, r (FIG. 10). The correlation matrix ^ons that are already made and that it is the sum of the 
is used to solve an assignment problem to optimize the bus diagonal values in the permuted matrix, 
assignment (116 FIG. 17). The assignment problem is solved Assuming that the permuted second matrix represented by 
by maximizing the sum of the diagonal of the correlation 15 me transpose in FIG. 12B is the optimum bus assignment 
matrix (see FIG. 11), while permuting the columns of the configuration for the second cycle, the permuted matrix is 
matrix. In other words, all of the permutations of the data then OR-ed with the first matrix to produce the accumulation 
transfer assignments and interconnects for the cycle corre- matrix 126 shown in FIG. 14. The bus assignments of this 
sponding to the matrix A 2 are analyzed, in order to maximize accumulation matrix represent the combined bus assign- 
ee sum of the diagonal of the correlation matrix A ee> . Those 20 ments derived from the first matrix and the permuted second 
skilled in the art will recognize that the actual process of matrix. 

maximizing the values on the diagonal can be achieved in a After the second matrix of each cross product is 

number of ways, but preferably by permuting (117) the optimized, the optimized matrix is stored. The resulting 

columns of the correlation matrix A^ to find the maximum optimized matrices represent an improved bus assignment 

for the trace of the correlation matrix A co . Once the desired 25 for tne hardware components for each cycle relative to the 

permutation is found, the corresponding permuted matrix for initial bus assignment configuration. However, it has been 

the cycle under consideration is determined from the per- found that a second pass (FIG. 18) yields significantly 

muted correlation matrix (118), and stored. The result is improved results for bus assignments, 

stored (121, FIG. 17) as the optimum permuted bus assign- During the second pass, a second accumulation matrix is 

ment for the second matrix (A^. It should be noted that in 30 derived by OR-ing all of the permuted matrices except for 

the first pass, the matrix A a is optimized only relative to the the first matrix (128; FIG. 18). The second accumulation 

unpermuted bus assignment for the first matrix Aq. However, matrix is then multiplied by the transpose of the first matrix 

the second pass will permute the Aq bus assignments. in order to optimize the bus assignment configuration rep- 

An accumulation matrix A acc is first initialized at 112 resented by the first matrix. The optimization occurs in the 
(FIG. 17) and thereafter is updated at step 121 by OR-ing 35 same manner as described with respect to the first pass, 
together the first permuted matrix (A^ from step 120 and namely by solving the assignment problem algorithm for the 
the original matrix (Aq). This accumulation matrix is then product of the new accumulated matrix and the transpose of 
used to calculate a new correlation matrix (returning to 116 the first matrix (1 30; FIG. 18). The bus definitions repre- 
after testing 122 for the end of the sequence) along with the sented by the columns in the transposed matrix are permuted 
transpose of the next matrix (A„ +1 ). The permutations of the 40 until the optimum bus assignments for the first matrix are 
next matrix are then evaluated in the same way using the found. The permuted bus assignments for the first matrix are 
assignment problem algorithm (117) until the permuted then stored (132; FIG. 18). The second accumulation matrix 
matrix is optimized and stored. The newly permuted matrix is then recalculated at 128 using the new permuted first 
is then OR-ed (121 ) with the accumulation matrix A acc to matrix and all of the other permuted matrices except for the 
obtain a new accumulation matrix. The accumulation matrix 45 permuted second matrix from the first pass. The second 
tracks the optimization results of the permutations in the matrix is then optimized (130) relative to the new second 
preceding steps and updates the matrix so that each succes- accumulation matrix and then stored (132). All of the 
sive correlation matrix is current. The accumulation matrix remaining permuted matrices are re-optimized in the second 
is used in the rest of the first pass to optimize the remaining pass in the same manner (FIG. 18). The permuted matrices 
bus assignment matrices by repeatedly solving the assign- 50 are then output (134), representing optimized bus assign- 
ment problem algorithm (117; FIG. 17). Specifically, ments for each cycle in the initiation interval. The output can 
throughout the rest of the first pass (FIG. 17), an updated be combined into a single definition of the bus interconnects 
accumulation matrix is generated at each step by Or-ing with the hardware components and thereafter used to create 
together the optimized matrices and the starting matrix. It is a hardware definition language along with the function unit 
believed that during the first pass, the accumulation matrix ss assignments and schedule. Alternatively, the optimized bus 
represents a gradually increasing quality combination of the assignments can be stored for later use in creating a hard- 
previously optimized bus assignments. ware description language for the processor. 

An example of an improved correlation matrix developed Optimization of the bus assignments produces a more 
by permuting the bus assignment data for a given matrix can efficient bus interconnect configuration, resulting in a pro- 
be seen by comparing the results of the correlation matrices 60 cessor that is less expensive to produce. Using an algorithm 
in FIG. 11 and FIG. 13. The correlation matrix of FIG. 11 is for solving assignment p r oblems in this context nr opnmiT - 
derived from the cross product of the first matrix Aq (FIG. ing b us interconnect conngurations leads to a less time-_ 
9 A) with the transpose of the second matrix (FIG. 10). The -cuub>lUnin£ ana to a more emcient method for d esigning 
cross product producing the accumulation matrix in FIG. U pTOccss m-c, This ircpocially applicable tU d esign ot pro- 
does not have any bus assignments in common between the 65 cessors using software pipelined data paths, 
two matrices, as evidenced by the diagonal containing all Having thus described several exemplary implementa- 
zeros. However, if the second matrix A 1 (FIG. 9B) is tions of the invention, it will be apparent that various 
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alterations and modifications can be made without departing 
from the inventions or the concepts discussed herein. Such 
operations and modifications, though not expressly 
described above, are nonetheless intended and implied to be 
within the spirit and scope of the inventions. Accordingly, 5 
the foregoing description is intended to be illustrative only. 
What is claimed is: 

1. A method of assembling a set of hardware components 
and bus assemblies, the method comprising the steps of: 

an identifying step identifying each bus in a plurality of 10 
busses and at least one hardware component to which 
each bus is assigned for a given operation; 

another identifying step identifying at least two bus 
assignments for which different operations occur on the 
same hardware component; and 15 

assigning at least some of the hardware components for 
which different operations occur on the same hardware 
component to the same bus. 

2. The method of claim 1 where in the second recited 2Q 
identifying step includes the step of identifying at least two 
bus assignments for which different operations occur on the 
same hardware component at different times. 

3. The method of claim 1 wherein the first recited iden- 
tifying step includes the step of preparing a table of bus 25 
assignments. 

4. The method of claim 3 wherein the step of preparing a 
table includes the step of preparing a plurality of matrices 
representing bus assignments. 

5. The method of claim 4 wherein the step of preparing a 3Q 
table includes the step of comparing different matrices 
corresponding to different cycle times. 

6. The method of claim 4 wherein the step of assigning 
includes the step of solving an assignment problem algo- 
rithm. 35 

7. The method of claim 6 wherein the step of solving 
includes the step of permuting bus assignments in the table 
of bus assignments. 

8. The method of claim 4 further comprising the steps of 
calculating a cross product of a first matrix and the transpose 4Q 
of a second matrix and maximizing the sum of values on the 
diagonal of the resultant matrix. 

9. The method of claim 8 wherein the step of maximizing 
includes the step of permuting representations of bus assign- 
ments for the second matrix. 45 

10. The method of claim 9 further comprising the step of 
saving a permuted matrix. 

11. The method of claim 10 further comprising the step of 
determining an accumulation matrix representing a combi- 
nation of at least two bus assignment matrices. 
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12. The method of claim 11 wherein the step of deter- 
mining an accumulation matrix includes the step of Or-ing 
a permuted matrix with another matrix. 

13. The method of claim 12 wherein the step of calcu- 
lating a cross product includes the step of calculating a cross 
product for every matrix representing a bus assignment, and 
wherein the step of determining an accumulation matrix 
includes the step of determining an accumulation matrix 
using each permuted matrix. 

14. The method of claim 9 wherein the step of maximizing 
includes the step of determining an accumulation matrix 
from a plurality of permuted matrices and maximizing a 
cross product between a permuted matrix and the accumu- 
lation matrix determined from the plurality of permuted 
matrices. 

15. A method of selecting a set of hardware component 
and bus assembly interconnections, the method comprising 
the steps of: 

identifying each bus in a plurality of busses and at least 
one hardware component to which each bus is assigned 
for a given operation, and representing each bus and the 
respective assigned hardware component in a table; 

identifying for each bus in the plurality of busses a time 
in a sequence of times when the bus is to be connected 
to the hardware component; and 

combining at least two busses that are connected to the 
same hardware component at different times into a 
single bus. 

16. The method of claim 15 wherein the step of combining 
includes the step of permuting the representations of each 
bus in the table. 

17. The method of claim 15 further comprising the step of 
representing each bus and hardware components in discrete 
tables representing different cycle times. 

18. A system for selecting a set of hardware components 
and bus assembly interconnections, the system comprising: 

a processor containing an algorithm for solving assign- 
ment problems; 

means for identifying a plurality of bus assignments 
corresponding to hardware components and permuting 
the bus assignments for optimizing the number of bus 
assignments corresponding to hardware components; 
and 

producing representations of bus assignments correspond- 
ing to hardware components. 

***** 
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